-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable computation of CUDA global kernels derivative in reverse mode #1059
Enable computation of CUDA global kernels derivative in reverse mode #1059
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1059 +/- ##
==========================================
- Coverage 94.40% 94.38% -0.02%
==========================================
Files 55 55
Lines 8430 8445 +15
==========================================
+ Hits 7958 7971 +13
- Misses 472 474 +2
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
@@ -210,10 +235,34 @@ | |||
printf("CladFunction is invalid\n"); | |||
return static_cast<return_type_t<F>>(return_type_t<F>()); | |||
} | |||
if (m_CUDAkernel) { | |||
printf("Use execute_kernel() for global CUDA kernels\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: do not call c-style vararg functions [cppcoreguidelines-pro-type-vararg]
printf("Use execute_kernel() for global CUDA kernels\n");
^
lib/Differentiator/DiffPlanner.cpp
Outdated
int numArgs = static_cast<int>(call->getNumArgs()); | ||
if (numArgs > 4) { | ||
auto kernelArgIdx = numArgs - 1; | ||
auto cudaKernelFlag = new (C) CXXBoolLiteralExpr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: initializing non-owner 'CXXBoolLiteralExpr *' with a newly created 'gsl::owner<>' [cppcoreguidelines-owning-memory]
auto cudaKernelFlag = new (C) CXXBoolLiteralExpr(
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
I have added such a test in |
I think the Differentiator.h is difficult to test right now. Let's ignore this report as we know it's covered. I need to figure out how to run the cuda tests, too. |
Agreed.
You mean locally? |
No, adding a bot that has a cuda device to run the code we develop. That's an action item on me :) |
typename std::enable_if<EnablePadding, bool>::type = true> | ||
CUDA_HOST_DEVICE void | ||
execute_with_default_args(list<Rest...>, F f, list<fArgTypes...>, dim3 grid, | ||
dim3 block, size_t shared_mem, cudaStream_t stream, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dim3 block, size_t shared_mem, cudaStream_t stream, | |
dim3 block, size_t shared_mem, cudaStream_t stream, |
Can we wrap that in a CUDA-specific macro, say CUDA_ARGS, which gets expanded only in cuda mode.
This will allow you to wrap the cudaLaunchKernel and friends with #ifdef __CUDACC__
and allow duplicating the overly complicated templates already...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can go even deeper and assume that in cuda mode Args
has size_t shared_mem, cudaStream_t stream
and extract it before forwarding the rest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this pull-request. Overall, this looks great!
@@ -210,10 +235,34 @@ inline CUDA_HOST_DEVICE unsigned int GetLength(const char* code) { | |||
printf("CladFunction is invalid\n"); | |||
return static_cast<return_type_t<F>>(return_type_t<F>()); | |||
} | |||
if (m_CUDAkernel) { | |||
printf("Use execute_kernel() for global CUDA kernels\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably assert-out if users use execute
instead of execute_kernel
for CUDA kernels. 'printf' seems to be too subtle for reporting such a big error. @vgvassilev What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a programmatic error or user error. We can use assert/abort if the error is programmer error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a user error basically. They should use the appropriate execute
function depending on whether their function is a CUDA kernel or not.
template <typename... Args, class FnType = CladFunctionType> | ||
typename std::enable_if<!std::is_same<FnType, NoFunction*>::value, | ||
return_type_t<F>>::type | ||
execute_kernel(dim3 grid, dim3 block, size_t shared_mem, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would be a good idea to provide overloads for execute_kernel
that takes default values for shared_mem
and stream
args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can use the default values of 0 and nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I needed to use constexpr for this, which is c++17. So cuda tests cannot run with an older c++ version. @vgvassilev is that too big of a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need constexpr
for having default values for shared_mem
and stream
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the signature for execute_kernel with default args is: dim3, dim3, Args&&...
the function chosen when the args should not be default was the one above instead of the correct: dim3, dim3, size_t, cudaStream_t, Args&&...
Hence, I kept only the signature mentioned first and check if a cudaStream_t arg is included in Args, otherwise I use default values to call execute_with_default_args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vgvassilev I think it's better to not require the grid parameters to be evaluated at compile time. CUDA doesn't require that in general. So, we can't do this:
execute_kernel<grid, block, shared_mem, stream>(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
@kchristin22 Can you please rebase the PR on top of master? |
This PR closes #1036. The CUDA kernels can now be an arg to a clad::gradient function and the derived kernels can be executed successfully.
Details on issues faced and the final approach can be found here: #1036 (comment).
Old PR's approach involved the use of CUDA API libs to compile and load the computed kernel on the GPU.